Skip to content

Conversation

@aaron-ang
Copy link
Contributor

Changes Made

Related Issues

Closes #4705.

@github-actions github-actions bot added the feat label Jan 31, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 31, 2026

Greptile Overview

Greptile Summary

Added variance aggregation function (var()) with configurable degrees of freedom (ddof) parameter, resolving #4705. The implementation supports both sample variance (ddof=1, default) and population variance (ddof=0).

Key Changes:

  • Added Expr.var(ddof) method to Python API following existing patterns
  • Created corresponding daft.functions.var() function per custom rule 8b8ae412
  • Implemented distributed variance calculation using algebraic decomposition: var = (E(X²) - E(X)²) * n/(n-ddof)
  • Added SQL support: var, var_samp, var_pop, variance functions
  • Comprehensive test coverage including edge cases (single values, nulls, grouped operations, multiple partitions)

Implementation Notes:

  • Single partition: Uses direct calculation via calculate_variance() in stats utils
  • Multiple partitions: Decomposes into sum(X), sum(X²), and count(X) for distributed aggregation
  • Properly handles edge case where n ≤ ddof by returning null
  • All variance results cast to Float64 type

Confidence Score: 4/5

  • Safe to merge with one minor style issue regarding inline imports
  • Implementation is mathematically sound and follows established patterns from stddev. Comprehensive test coverage validates correctness across edge cases. One style comment about inline imports (custom rule violation present throughout codebase).
  • No files require special attention - all changes follow existing patterns

Important Files Changed

Filename Overview
daft/expressions/expressions.py Added var() method with ddof parameter, following existing patterns
src/daft-core/src/utils/stats.rs Added calculate_variance() function with ddof support and edge case handling
src/daft-core/src/array/ops/variance.rs Implemented DaftVarianceAggable trait for single and grouped variance operations
src/daft-dsl/src/expr/mod.rs Added Var(ExprRef, usize) variant to AggExpr with full integration
src/daft-local-plan/src/agg.rs Implemented distributed variance aggregation using Welford's online algorithm
tests/dataframe/test_variance.py Comprehensive test coverage for variance with different ddof values, partitions, and edge cases

Sequence Diagram

sequenceDiagram
    participant User
    participant Python API
    participant DSL Layer
    participant Local Plan
    participant Core Ops
    participant Stats Utils

    User->>Python API: df.agg(col("x").var(ddof=1))
    Python API->>DSL Layer: expr.var(ddof)
    DSL Layer->>DSL Layer: Create AggExpr::Var(expr, ddof)
    
    alt Single Partition
        DSL Layer->>Core Ops: Series.var(groups=None, ddof)
        Core Ops->>Core Ops: Cast to Float64
        Core Ops->>Stats Utils: calculate_stats(array)
        Stats Utils-->>Core Ops: Stats{sum, count, mean}
        Core Ops->>Stats Utils: calculate_variance(stats, values, ddof)
        Note over Stats Utils: variance = sum_of_squares / (n - ddof)
        Stats Utils-->>Core Ops: variance value
        Core Ops-->>User: Result
    else Multiple Partitions
        DSL Layer->>Local Plan: populate_aggregation_stages
        Note over Local Plan: First Stage (per partition)
        Local Plan->>Core Ops: Sum(X), Sum(X²), Count(X)
        Core Ops-->>Local Plan: Partial results
        Note over Local Plan: Second Stage (combine partitions)
        Local Plan->>Local Plan: Sum(sums), Sum(sq_sums), Sum(counts)
        Note over Local Plan: Final Stage
        Local Plan->>Local Plan: pop_var = (sq_sum/n) - (sum/n)²
        Local Plan->>Local Plan: sample_var = pop_var * n/(n-ddof)
        Local Plan->>Local Plan: if n <= ddof then null else sample_var
        Local Plan-->>User: Result
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Tip: See Also
[`daft.functions.var`](https://docs.daft.ai/en/stable/api/functions/var/)
"""
from daft.functions import var
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline import violates custom rule 430ffc3f: imports should be at the top of the file. While this pattern is used consistently across the codebase (e.g., stddev() method), it still conflicts with the stated guideline.

Suggested change
from daft.functions import var
return var(self, ddof)

Move the import to the top of the file with other imports.

Context Used: Rule from dashboard - Import statements should be placed at the top of the file rather than inline within functions or met... (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: daft/expressions/expressions.py
Line: 1001:1001

Comment:
Inline import violates custom rule 430ffc3f: imports should be at the top of the file. While this pattern is used consistently across the codebase (e.g., `stddev()` method), it still conflicts with the stated guideline.

```suggestion
        return var(self, ddof)
```

Move the import to the top of the file with other imports.

**Context Used:** Rule from `dashboard` - Import statements should be placed at the top of the file rather than inline within functions or met... ([source](https://app.greptile.com/review/custom-context?memory=430ffc3f-245c-4a7f-8039-aba31c0ed558))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@codecov
Copy link

codecov bot commented Jan 31, 2026

Codecov Report

❌ Patch coverage is 10.18519% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.31%. Comparing base (7bc6c81) to head (e28af4e).

Files with missing lines Patch % Lines
src/daft-core/src/array/ops/variance.rs 0.00% 17 Missing ⚠️
src/daft-core/src/series/ops/agg.rs 0.00% 17 Missing ⚠️
src/daft-local-plan/src/agg.rs 0.00% 17 Missing ⚠️
src/daft-dsl/src/expr/mod.rs 0.00% 15 Missing ⚠️
src/daft-core/src/datatypes/agg_ops.rs 0.00% 10 Missing ⚠️
src/daft-core/src/utils/stats.rs 0.00% 10 Missing ⚠️
src/daft-logical-plan/src/ops/project.rs 0.00% 3 Missing ⚠️
src/daft-recordbatch/src/lib.rs 0.00% 3 Missing ⚠️
src/daft-sql/src/modules/aggs.rs 66.66% 3 Missing ⚠️
src/daft-dsl/src/expr/agg.rs 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6105      +/-   ##
==========================================
- Coverage   43.35%   43.31%   -0.05%     
==========================================
  Files         917      918       +1     
  Lines      113004   113102      +98     
==========================================
- Hits        48996    48987       -9     
- Misses      64008    64115     +107     
Files with missing lines Coverage Δ
daft/expressions/expressions.py 95.22% <100.00%> (+0.01%) ⬆️
daft/functions/__init__.py 100.00% <ø> (ø)
daft/functions/agg.py 97.72% <100.00%> (+0.10%) ⬆️
src/daft-core/src/datatypes/mod.rs 17.07% <ø> (ø)
src/daft-dsl/src/expr/agg.rs 0.00% <0.00%> (ø)
src/daft-logical-plan/src/ops/project.rs 45.79% <0.00%> (-0.29%) ⬇️
src/daft-recordbatch/src/lib.rs 21.57% <0.00%> (-0.05%) ⬇️
src/daft-sql/src/modules/aggs.rs 38.88% <66.66%> (+1.60%) ⬆️
src/daft-core/src/datatypes/agg_ops.rs 10.95% <0.00%> (-1.74%) ⬇️
src/daft-core/src/utils/stats.rs 0.00% <0.00%> (ø)
... and 4 more

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@srilman
Copy link
Contributor

srilman commented Jan 31, 2026

@aaron-ang if you don't mind, in the future could you ask to work on the issue? I ask because someone else was already assigned to the issue. But otherwise, great work, will TAL!

@aaron-ang
Copy link
Contributor Author

@aaron-ang if you don't mind, in the future could you ask to work on the issue? I ask because someone else was already assigned to the issue. But otherwise, great work, will TAL!

ok, I will choose to work on unassigned issues in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expression.var

2 participants